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ABSTRACT 

In many applications today user interaction is moving away 
from mouse and pens and is becoming pervasive and much 
more physical and tangible. New emerging interaction tech- 
nologies allow developing and experimenting with new in- 
teraction methods on the long way to providing intuitive hu- 
man computer interaction. In this paper, we aim at recogni- 
zing gestures to interact with an application and present the 
design and evaluation of our sensor-based gesture recogniti- 
on. As input device we employ the Wii-controller (Wiimo- 
te) which recently gained much attention world wide. We 
use the Wiimote’s acceleration sensor independent of the ga- 
ming console for gesture recognition. The system allows the 
training of arbitrary gestures by users which can then be re- 
called for interacting with systems like photo browsing on 
a home TV. The developed library exploits Wii-sensor data 
and employs a hidden Markov model for training and reco- 
gnizing user-chosen gestures. Our evaluation shows that we 
can already recognize gestures with a small number of trai- 
ning samples. In addition to the gesture recognition we also 
present our experiences with the Wii-controller and the 1m- 
plementation of the gesture recognition. The system forms 
the basis for our ongoing work on multimodal intuitive me- 
dia browsing and are available to other researchers in the 
field. 


Author Keywords 
tangible user interfaces, gesture recognition, Wiimote 


ACM Classification Keywords 
[H.5.2 User Interfaces]: Haptic I/O 


INTRODUCTION 

In recent years, we find more and more affordable hardware 
that allows the development of multimodal user interfaces. 
Recently one of these interfaces is the so called Wiimote [1], 
the device that serves as the wireless input for the Nintendo 
Wii gaming console. The Wiimote can detect motion and ro- 
tation in three dimensions through the use of accelerometer 
technology. Separating the controller from the gaming con- 
sole, the accelerometer data can be used as input for gesture 
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Figure 1. The Wii Controller (Wiimote). 


recognition. In our work, we address the recognition of ge- 
stures for new multimodal user interfaces. We are interested 
in recognizing arbitrary gestures of users that are performed 
by one hand. We choose the Wiimote as our input device for 
its ease of use, the hardware price and the design. 


Accelerometer-based gesture recognition has been discussed 
in many publications, most prominently in those by Hof- 
mann et al. in [4] and most recently in those by Mantyjarvi et 
al. in [6] and [7]. Like the commercial work by AiLive Inc. 
(cf. [2]) we aim for a system allowing the training and re- 
cognition of arbitrary gestures using an accelerometer-based 
controller. In doing so we have to deal with spatially as well 
as temporally variable patterns and thus need a theoretical 
backbone fulfilling these demands. We transfer the methods 
proposed in [6,7] who are using special hardware for 2D 
gesture recognition to the consumer hardware of the Wii- 
mote and recognize 3D hand gestures. With the controller 
the user can make her own, closed gestures and our gesture- 
recognition aims at a Wii-optimized recognition. Our com- 
ponents as well as the filtering process is specifically targe- 
ted to the Wiimote. With this paper we also share our expe- 
riments and the resulting implementation with other resear- 
chers. 


CONCEPT 

In gesture recognition using an acceleration sensor, gestures 
are represented by characteristic patterns of incoming signal 
data, i.e. vectors representing the current acceleration of the 
controller in all three dimensions. Hence, we need a system 
pipeline preparing and analyzing this vector data in order 
to train as well as recognize patterns for distinct gestures. 
For this purpose we revert to the classic recognition pipeline 
shown in Figure 2. It consists of the three main components 
quantizer, model and classifier. 
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Figure 2. Components of our recognition system. We use a total of two 
filters before following a traditional pipeline like [7]. The quantizer ap- 
plies a common k-mean algorithm to the incoming vector data, for the 
model a left-to-right hidden Markov model is used and the classifier is 
chosen to be a bayesian. 


As an accelerometer constantly produces vector data we first 
need a quantizer clustering the gesture data. Here, a common 
k-mean algorithm (cf. e.g. [5]) is applied. The model has be- 
en chosen to be a discrete hidden Markov model since it of- 
fers a long history in the service of gesture recognition and 
promises to deliver reliable results for patterns with spatial 
and temporal variation (cf. e.g. [4]). The remaining compo- 
nent is a classic Bayes-classifier. In addition to these main 
components we establish two filters for pre-processing the 
vector data, an “idle state” and a “directorial equivalence” 
filter. Both serve the purpose to reduce and simplify the in- 
coming acceleration data. 


As we want optimize the HMM for the task of an accele- 
rometer based gesture recognition we select the reference 
gestures shown in Figure 3 during the following tests and 
evaluations. With regard to the components of the classic 
gesture recognition approach in Figure 2 we identify three 
components for analysis and improvement: vector quantiza- 
tion, the concrete hidden Markov model and filters. 


Vector quantization 

Like other acceleration-sensors the one integrated into the 
Wiimote delivers too much vector data to be put into a single 
HMM. In order to cluster and abstract this data the common 
k-mean algorithm is applied with k being the number of clu- 
sters or codes in the so-called codebook. Since k must be 
determined empirically we decided to conduct tests to find 
a codebook size delivering satisfying results and as we are 
evaluating true 3D gestures we cannot rely on previous re- 
sults by Mantyjarvi et al. who empirically identified k = 8 
for gestures in a two-dimensional plane. However, we adopt 
their idea of arranging the 8 cluster centres on a circle by 
extending it to the 3D case. Instead of distributing the cen- 
tres uniformly on a two-dimensional circle we put them on a 
three-dimensional sphere, intersecting two circles orthogo- 
nal to each other (cf. Figure 4). Consequently this leads to 
k = 8+ 6 = 14 centres. For comparison, we also enhan- 
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(a) Square (b) Circle (c) Roll (d) Z 





(e) Tennis 


Figure 3. Reference Gestures. The gesture in (b) does not show a star- 
ting point because the gesture might start anywhere on the circle. Ge- 
sture (c) describes a 90°-roll around the z-axis (forth and back) and 
gesture (e) symbolizes the serve of a regular tennis match: raising the 
controller and then rapidly lowering it in a bow-curved manner. 
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Figure 4. Distribution of the cluster centres during quantization for k € 
{8, 14, 18}. We extrapolate from the two-dimensional case for k = 8 
with a simple circular distribution to a three-dimensional using two 
orthogonal circles for k = 14 to another three-dimensional using three 
orthogonal circles for k = 18 and evaluate which of them results in the 
most reliable behavior. 


ced the spherical distribution to include another four centers 
on the XZ-plane and thus gain & = 18 cluster centres. The 
radius of each circle/sphere dynamically adapts itself to the 
incoming signal data. 


We conducted a small evaluation comparing the three set- 
tings shown in Figure 4 using the reference gestures from 
Figure 3. We found that for & = 8 the recognition process 
cannot clearly differentiate between the five reference gestu- 
res. Since the gestures explore all three dimensions, laying 
out the centres on a two dimensional plane is not sufficient. 
With k = 14 the probabilities for the respective gestures 
improve as expected and the model can clearly distinguish 
between the five gestures. Using k = 18 results in “over- 
trained” HMMs, do not improve the probabilities and slow 
down performance. Consequently we choose k = 14 with 
the distribution shown in Figure 4(b). 


Hidden Markov Model 

In our system a HMM is initialized for every gesture and 
then optimized by the Baum-Welch algorithm (cf. [3]). Ho- 
wever, there are two competing HMM instances we might 
revert to: a left-to-right vs. an ergodic. While [4] claims that 
both approaches deliver comparable results, [9] states that a 
left-to-right model is clearly to be preferred when the inco- 
ming signals change over time. We implemented both mo- 
dels and ran a test to determine which model better suits our 
needs. Table | shows the results for both possible instances 
and a varying number of states. Our results confirm the state- 
ment by [4] that no instance is significantly better than the 
other as well as the statement by [8] that the influence of the 
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Table 1. Model probabilities for left-to-right and ergodic HMM with 
varying number of states. Our evaluation confirms the statement by 
[4] that neither the number of states nor the concrete HMM instance 
influence the results all too much. 
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number of states is rather weak. In the end we chose our mo- 
del to be a left-to-right HMM with 8 states for convenience. 


Filtering 

Before the actual recognition process our system applies two 
filters to the vector data establishing a minimum representa- 
tion of a gesture before being forwarded to the HMM for 
training or recognition. The first filter is a simple threshold- 
filter eliminating all vectors which do not contribute to the 
characteristic of a gesture in a significant way, i.e. all a for 
which |a| < A. We call this filter the “idle state filter’ and 
determined A to a value of A = 1.2g, g being the accelera- 
tion of gravity. The second filter is called “directorial equi- 
valence filter’ and eliminates all vectors which are rough- 
ly equivalent to their predecessor and thus contribute to the 
characteristic of a gesture only weakly. Vectors are omitted 
if none of their components c € {x, y, z} is all too different 
to the corresponding component of their predecessor, 1.e. if 
ay” ~ ar) < ¢ for all c. € was chosen to be 0.2 in the 
case of the Wiimote. 


As Figure 5 shows, this filter would ideally lead to just four 
characteristic acceleration vectors in the case of the gesture 
“square”. In addition, Figure 6 demonstrates the reduction 
of the number of vectors for every reference gesture after 
applying both filters. 


(a) Before filtering (b) After filtering 


Figure 5. Effect of the directorial equivalence filter. Applying it would 
ideally lead to just four acceleration vectors for the gesture Square. 
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Figure 6. Reduction of vector data during filtering. The first bar for 
each gesture represents the average number of vectors after applying 
the first filter (“idle state’’), the second bar the average number of vec- 
tors after applying the second, the “directorial equivalence” filter. As 
one can see the number of vectors are heavily reduced by this process 
which leads to more reliable as well as faster recognition results. 
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IMPLEMENTATION 

In our prototype we use the Nintendo Wiimote Wireless 
Controller with an integrated three axis acceleration sensor 
(Analog Devices ADXL330). Since the Wiimote is designed 
for human interaction with the Wii-Console it provides the 
ability for basic in-game gesture recognition. Connected via 
the Bluetooth Human Interface Device (HID) protocol it is 
possible to readout its self-description data. The meaning of 
this communicated data has been reverse engineered by the 
open-source community.! Based on these findings it is pos- 
sible to establish a basic communication with the Wiimote. 


We implemented the gesture recognition in Java using the 
standardization of Java APIs for Bluetooth Wireless Tech- 
nology (JABWT) defined by the JSR-82 specification. Using 
Java ensures platform independency, for developing and te- 
sting purposes we use the GNU/Linux platform with the 
Avetana Bluetooth implementation.” 


The recognition process 1s realized as a reusable and extensi- 
ble gesture recognition library based on an event-driven de- 
sign pattern. The library provides an interface for basic func- 
tions, e.g. acceleration readout with the WiiListener in- 
terface, as well as recognition functions using a Gesture- 
Listener interface. Through its modularity it is easy to 
adapt our prototype to other acceleration-based controllers. 
We intend to make the library available to other researchers 
in the field. 


EVALUATION 

In order to determine the performance of our system we con- 
ducted an evaluation. We collected quantitative data to de- 
termine the percentage of correctly recognized gestures for 
gestures trained by users themselves. In order to make the 
results comparable among the individual participants the fi- 
ve gestures described in Figure 3 were used by all partici- 
pants. The group consists of one woman and five men aged 
between 19 and 32 years. All participants had some minor 
experience with the Wiimote and none used the Wiimote re- 
gularly. None of the participants was physically disabled. 


Preparing the evaluation we set up our environment and the 
Bluetooth connection to the Wiimote. The participants got a 
brief explanation of the purpose of the system and how to in- 
teract with the Wiimote. Afterwards we introduced the five 
gestures using drawings of the five gestures (see Figure 3) 
and demonstrated the execution of the first gesture Square. 
Each participant was asked to perform each gesture fifteen ti- 
mes resulting in 75 gestures per participant. The participants 
had to push and hold the A-button on the Wiimote while 
performing gestures. After each completing of the respective 
fifteen gestures the user had to press the Wiimote’s HOME- 
button and the drawing of the next gesture was shown. Each 
session lasted for fifteen minutes on average and the parti- 
cipants received no feedback from the system. During the 
evaluation we stored the complete raw data transmitted by 
the Wiimote. 


'E.g., www.wiili.org 
* www.avetana-gmbh.de/avetana-gmbh/produkte/jsr82.eng.xml 
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Figure 7. Participant during the evaluation of the gesture recognition. 


To analyze the determined results we trained the gesture re- 
cognition system with the collected data. The system was 
trained using the leave-one-out method to make sure that the 
models were evaluated on sequences that were not used for 
training. That means for each participant fifteen training sets 
each containing the five gestures were computed. These trai- 
ning sets were used to recognize the remaining five gestu- 
res. The average rate of correctly recognized gestures was 
90 percent. The averaged recognition rate for each of the fi- 
ve gestures is shown in Figure 8. The averaged recognition 
rate for the six participants is shown in Figure 9. 
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Figure 8. Average recognition rate of the five gestures. The results for 
the five gestures were Square = 88.8%, Circle = 86.6%, Roll = 84.3%, 
Z = 94.3%, and Tennis = 94.5%. 


CONCLUSION 

Developing new intelligent user interfaces involves experi- 
mentation and testing of new devices for interaction tasks. 
In our research, we are working in the field of multimodal 
user interfaces including visual, acoustic and haptic I/O. Ba- 
sed on the Wiimote we developed a gesture recognition that 
employs state of the art recognition methodology such as 
HMM, filters and classifiers, and aim to optimize hand ge- 
sture recognition for the Wiimote. As the gestures can be 
user-chosen the system is not limited to predefined gestu- 
res but allows each user to train and use individual gestures 
for a personalized user interaction with gestures. To be ab- 
le to measure recognition results we trained and evaluated 
the system based on a set of reference gestures taken to be 
relevant for different task such as gaming, drawing or brow- 
sing. The recognition results vary between 85 to 95 percent, 
which is promising but leaves room for further optimizati- 
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Figure 9. Average recognition rate of the four users. The results for the 
six participants were 84.0%, 87.8%, 87.8%, 92.0%, 93.4%, and 93.4%. 


on of the model and filters. We make the implementation of 
the gesture recognition library publicly available* and as the 
Wiimote is a low-cost device we invite other researchers to 
extend and share their experiences. 
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